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ABSTRACT 

We  investigate  the  complexity  of  searching  by  comparisons  a  table  of  n 
elements  on  a  synchronous,  shared  memory  parallel  computer  with  p  processors.  We 
show  that  0(lgn)  steps  are  required  if  concurrent  accesses  to  the  same  memory 
cell  are  not  allowed,  whereas  only  0(lgn/lgp)  steps  are  required  if  simultaneous 
reads  are  allowed.  We  next  show  that  these  lower  bounds  are  still  valid  even  if 
only  communication  steps  are  counted,  and  even  if  polynomial  comparisons  are 
allowed. 
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1.  INTRODUCTION 

There  has  been  recently  an  increasing  interest  in  the  study  of  tightly 
coupled  parallel  machines.  Parallel  processors  have  been  around  for  some  time, 
and  with  the  advance  in  microelectronics  it  becomes  feasible  to  build  parallel 
machines  with  thousands  of  cooperating  processors.  One  possible  organization  for 
such  machines  is  a  shared  memory  architecture:  many  independent  processors  with 
access  to  a  common  memory.  While  the  problem  of  providing  shared  access  to  a 
common  memory  is  significant,  such  architecture  seems  suitable  for  a  general 
purpose  computer,  where  the  pattern  of  memory  accesses  is  not  known  in  advance. 
Also,  it  seems  feasible  to  build  a  shared  memory  parallel  machine  with  thousands 
of  processors  [Go]. 

Many  authors  have  studied  algorithms  for  shared  memory  parallel  machines, 
especially  numerical  ones  (see  [Sc]  for  a  comprehensive  bibliography).  The 
models  used  by  the  authors  are  not  always  identical,  and  differ  mainly  in  the 
assumptions  made  about  concurrent  access  to  memory.  Most  authors  assume  that  a 
memory  register  can  be  accessed  concurrently  by  more  than  one  processor,  but  that 
exclusive  access  is  required  in  order  to  modify  its  content.  Some  authors  have 
also  studied  models  where  concurrent  accesses  are  not  supported  [Pr],  [LPV],  or 
where  concurrent  writes  are  allowed  [SV],  [Krl].   See  also  [BHj. 

There  are  very  few  nontrivial  lower  bounds  for  parallel  computations.  The 
standard  argument  used  in  proving  such  lower  bounds  is  that  of  data  dependency  in 
a  straight-line  program.  In  real  applications  computation  time  is  often 
dominated  by  the  overhead  of  coordinating  a  large  number  of  processors,  and 
algorithms  where  the  control  flow  is  highly  dependent  on  the  input  data  are  hard 
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to  parallelize.  In  this  paper  we  use  a  control  dependency  lower  bound  argument 
for  parallel  searching.  We  show  that  a  list  of  n  items  can  be  searched  in 
parallel  with  p  processors  in  time  0(lgn/lgp),  but  that  if  concurrent  access  to  a 
memory  location  is  not  allowed,  then  fi(lgn)  steps  are  needed,  as  the  time  spent 
in  coordinating  the  processors  offsets  exactly  the  gain  obtained  from 
simultaneous  table  look-ups. 

For  a  model  of  parallel  computations  where  concurrent  access  to  the  same 
memory  location  is  not  allowed,  searching  provides  an  example  of  a  problem  which 
is  "completely  unparallelizable"  [Sc]:  No  speedup  is  achieved  by  parallelism.  A 
similar  example  was  recently  given  by  Kruskal  in  [Kr2]  where  a  particular  case  of 
searching  a  linked  tree  in  parallel  is  examined. 

The  fi(lg(n))  lower  bound  on  search  implies  a  similar  lower  bound  for  the 
insertion  problem  on  a  shared  memory  parallel  machine  with  no  concurrent  access 
to  to  the  same  memory  location.  This  settles  an  open  problem  posed  by  Borodin  in 
[Bo]. 

This  result  also  settles  the  problem  of  the  relative  power  of  the  different 
shared-memory  machine  models.  It  is  known  that  a  p-processor  machine  which 
supports  concurrent  accesses  to  the  same  location  in  memory  can  be  simulated  by  a 
p-processor  machine  with  no  concurrent  accesses  to  the  same  location  in  memory 
with  a  lg(p)  time  penalty  [Ec].  Our  result  shows  that  this  simulation  is 
optimal . 


-4- 
The  remaining  of  this  paper  is  organized  as  follows.  Models  of  parallel 
machines  are  introduced  in  the  next  section.  Searching  on  a  parallel  machine 
that  support  concurrent  reads  is  examined  in  section  3,  and  the  main  result 
concerning  searching  on  a  parallel  machine  that  does  not  support  concurrent 
access  to  the  same  memory  location  is  brought  in  section  A.  Algorithms  that  use 
generalized  comparisons  are  considered  in  section  5.  In  section  6  we  consider 
parallel  machines  with  local  and  shared  memory,  where  only  accesses  to  shared 
memory  are  accounted  for.  In  section  7  we  extend  the  lower  bounds  proven  in 
section  6  to  algorithms  that  use  polynomial  comparisons.  Conclusions  and  open 
problems  are  brought  in  the  last  section. 


2.  PARALLEL  COMPUTATION  MODELS 

We  consider  parallel  machines  that  consists  of  autonomous  serial  processors 
sharing  a  central  memory.  Each  processor  can  execute  basic  primitive  operations 
such  as  comparisons  or  arithmetic  operations  in  a  single  step.  Such  operation 
may  involve  fetching  the  operands  from  shared  memory  and/or  storing  the  result  in 
shared  memory.  Each  processor  executes  independently  its  own  program,  in  step 
lock  with  the  other  processors.  We  obtain  different  models  by  varying  the 
assumptions  concerning  simultaneous  accesses  to  shared  memory: 

1.  Exclusive  Read,  Exclusive  Write  (EREW)  -  Simultaneous  accesses  to  the  same 
register  are  not  allowed. 

2.  Concurrent  Read,  Exclusive  Write  (CREW)  -  Simultaneous  reads  are  allowed 
but  a  processor  can  modify  a  memory  register  only  if  it  has  exclusive 
access  to  it. 

3.  Concurrent  Read,  Concurrent  Write  (CRCW)   -   Both   simultaneous   reads   and 


-5- 

simultaneous  writes  are  allowed.  The  effect  of  simultaneous  actions  by  the 
processors  is  as  if  the  actions  occured  in  some  serial  order  (for  other 
possible  definitions  of  CRCW,  see  [BH]). 

Note  that  these  models  correspond  to  the  concept  of  a  shared  memory  MIMD 
(multiple  instruction  stream,  multiple  data  stream)  machine.  A  more  restrictive 
model  is  obtained  by  assuming  that  all  the  processors  execute  the  same 
instruction  stream  (SIMD  model).  However,  SIMD  machines  that  are  able  to  combine 
in  one  step  the  results  of  tests  performed  at  each  processor  and  to  branch 
according  to  the  value  of  this  combination  may  be  more  powerful  than  MIMD 
machines  in  certain  applications. 

Another  model  of  parallelism  often  used  is  that  of  a  fixed-geometry  parallel 
machine.  It  consists  of  a  network  of  synchronized  sequential  processors.  Each 
processor  can  communicate  in  one  step  to  those  processors  to  which  it  is  directly 
connected.  We  can  represent  a  unidirectional  communication  link  by  a  register 
which  can  be  written  upon  by  one  processor  (the  link  input)  and  can  be  read  by 
another  processor  (the  link  output).  Thus,  a  fixed-geometry  parallel  machine  can 
be  viewed  as  a  restricted  EREW  shared-memory  parallel  machine,  where  for  each 
register  in  memory  there  is  a  unique  processor  that  has  read  access  to  it,  and  a 
unique,  possibly  distinct,  processor  that  has  write  access  to  it.  In  particular, 
any  lower  bound  proven  for  computations  on  the  EREW  model  of  a  shared-memory 
parallel  machine  is  valid  for  any  fixed-geometry  parallel  machine  with  the  same 
number  of  processors. 
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We   now  restrict   this   general   framework  to   parallel   comparison-based 

algorithms.    A  decision  problem  11  with  n  inputs  is  defined  to  be  a  function  that 

assigns  to  each  n-tuple  of  numbers  x, , . . ,Xj^  in  a  set  S   R"  one  of   finitely  many 

possible   outcomes.   Formally  a  decision  problem  is  a  partial  function  from  R"  to 

a  finite  set  0.  For  example,  the  sorting  problem  for  n  inputs  associates  with 
each  n-tuple  of  distinct  numbers  x,,...,Xj^  that  permutation  a  e  S^^  that  fulfils 
the  condition  x^^^^  <  x<,(2)  <...<  ^ain)' 

The  usual  computational  model  used  to  analyse  the  complexity  of  decision 
problems  is  that  of  a  comparison-based  algorithm.  In  such  algorithm  inputs  are 
manipulated  as  atomic  entities,  and  the  only  operations  performed  on  them  are 
comparisons  and  moves.  The  control  flow  of  such  algorithm  can  be  represented  by 
a  decision  tree.  The  decision  tree  model  is  not  adequate  for  parallel  algorithms 
as  it  ignores  the  overhead  of  coordinating  processors  and  of  replicating  data  on 
a  machine  that  does  not  support  concurrent  reads.  Instead,  we  shall  use  a 
parallel  machine  model  where  only  comparisons  on  inputs,  data  moves,  and 
interprocessor  communication  through  shared  memory  are  allowed. 

A  Parallel  Test  and  Branch  Machine  (PTBM)  consists  of  p  >  1  processors,  a 
sequence  D,,..,D  of  shared  data  registers,  each  one  of  which  can  store  a  number 
(a  key),  and  a  sequence  C,,..,C  of  shared  control  registers ,  each  one  of  which 
can  store  a  value  from  a  finite  set  CV  of  control  values.  The  instructions  that 
can  be  executed  by  a  processor  are  of  one  of  the  following  forms: 

1.  READ  instructions:     READ  C.,  where  1  <  j  <  s. 

2.  WRITE  instructions:    C.  :=  c,  where  1  <  j  <  s,  and  ceCV. 

3.  MOVE  instructions:     D.  :=  D :: ,  where  1  <  i,j  <  r. 

4.  TEST  instructions:    D.  :  D-,  where  1  <  i,j  <  r. 
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Let  I  be  the  set  of  instructions.   Each  processor  is  a  finite-state  machine 
described  by  a  tuple  <S  ,Sq,S£,F  ,Tj,,T^j,Tjjj,T^.>,  where 
S  is  the  finite  set  of  states; 
Sq  is  the  initial  state; 
Sj  is  the  set  of  final  states; 

F:  S\Sj  -*•  I  associates  with  each  non  final  state  an  instruction; 

Let  S^  (respectively  S^,  S^  and  S^)  be  the  set  of  states  associated  with  READ 
(respectively  WRITE,  MOVE  and  TEST)  instructions;  Then  T^.  (T^^,  T^  and  T^)  are  the 
transition  functions  associated  with  each  of  these  four  sets  of  states.   Thus, 


T^:  S^  -  S,  and 
T^:  S^x{0,l}  >  S. 

A  computation  of  a  PTBM  starts  with  the  first  n  data  registers  storing  the 
inputs  to  the  computation,  the  remaining  data  registers  initialized  to  -",  and 
the  control  registers  initialized  to  the  initial  value  0.  At  each  step  each 
processor  executes  the  instruction  associated  with  its  current  state,  and  changes 
state  according  to  its  transition  function.  If  the  instruction  executed  is  a 
READ  then  the  next  state  depends  on  the  control  value  read  and  the  current  state; 
if  the  instruction  is  a  TEST  then  the  next  state  depends  on  the  outcome  of  the 
test  and  the  current  state  (each  test  is  a  binary  comparison  that  can  have  two 
distinct  outcomes);  if  the  current  instruction  is  a  WRITE  or  MOVE  then  the  next 
state  is  determined  by  the  current  state. 


Access  to  shared  variables  is  governed  by  one  of  the  methods  mentioned  at 
the  start  of  this  section.  In  an  EREW-PTBM  an  error  condition  occurs  whenever 
two  processors  attempt  to  access  the  same  register  simultaneously.  In  a 
CREW-PTBM  an  error  condition  occurs  if  a  processor  attempts  to  modify  a  register 
that  is  simultaneously  accessed  by  another  processor.  Finally,  in  a  CRCW-PTBM 
there  are  no  restrictions  on  concurrent  access  to  shared  registers. 

A  computation  of  a  PTBM  terminates  when  each  processor  has  reached  a  final 
state  (a  processor  that  has  reached  a  final  state  stays  in  this  state  at 
subsequent  steps).  Let  $  =slx...xsF  be  the  set  of  p-tuples  of  final  states  of 
the  p  processors  of  a  PTBM  ($  is  the  set  of  final  states  of  the  PTBM).  We  can 
associate  with  the  machine  a  partial  function  F  :  R"^  ->•  $  that  maps  each  tuple  of 
inputs  into  the  tuple  of  final  states  reached  by  the  p  processors,  if  the 
computation  terminates  normally. 

A  PTBM  with  p  processors  weakly  solves  the  decision  problem  H:  R"^  ->  0  in 
time  T  if  the  following  two  conditions  are  fulfilled: 

1.  Any  computation  with  inputs  in  the  domain  of  II  terminates  normally  after  no 
more  than  T  steps. 

2.  Each   final   state  (j)  e  $  can  be  associated  with  an  outcome  g((}))  £  0  so  that 
goF*  =  n. 

Note  that  this  is  a  very  weak  requirement:  We  require  that  the  final  state 
of  the  system  at  the  end  of  the  computation  encodes  the  answer  to  the  problem, 
but  we  do  not  require  that  this  answer  be  available  at  any  unique  processor 
(thus,  if  the  answer  is  a  number,  we  are  content  if  each  processor  has  computed 
one  bit  of  the  number,  and  do  not  require  that  the  number  be  available  at  any  one 
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processor).    A  stronger  and  more  natural  requirement  is  that  at  termination  time 
a  fixed  processor  holds  the  answer  to  the   problem.   Formally,   a   PTBM  with  p 
processors   strongly  solves   the  problem  II   in  time  T   if   the  following  two 
conditions  are  satisfied. 

1.  Any  computation  with  inputs  from  the  domain  of  II  terminates  normally  after 
no  more  than  T  steps. 

2.  Each  final  state  s  c  S^  of  ?i   can  be  associated  with  an  outcome  g(s)  e  0  so 
that  goF*  =  n. 

It  is  clearly  possible  to  collect  at  one  processor  complete  information 
about  the  state  of  the  system  in  lg(p)  WRITE/READ  cycles,  without  accessing 
concurrently  the  same  location  in  memory.   We  thus  have 

LEMMA  2.1  If  a  problem  n  can  be  weakly  solved  on  an  EREW-PTBM  (CREW-PTBM, 
CRCW-PTBM)  with  p  processors  in  time  T  then  it  can  be  strongly  solved  on  an 
EREW-PTBM  (CREW-PTBM,  CRCW-PTBM)  with  p  processors  in  time  T  +  21g(p). 

It  is  easy  to  show  that  this  result  can  not  be  strenghened  in  general. 

The  PTBM  model  can  be  extended  to  allow  for  more  general  comparisons,  by 
extending  the  set  of  available  test  instructions.  Note  that  the  shared  memory  of 
the  machine  is  used  for  two  distinct  purposes:  To  share  data  among  the 
processors,  and  to  coordinate  their  activity.  The  distinction  between  data 
registers  and  control  registers  ensures  that  control  information  is  created  only 
through  tests.  It  is  still  the  case  that  control  information  can  be  obtained 
from  a  data  register:  The  current  value  of  a  data  register  can  carry  the 
information  that   some   processor  performed  a   data  move  instruction  with  that 
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particular  register  as  a  target.   The  following  lemma  shows,  however,   that   this 
can  be  avoided  at  the  cost  of  a  constant  factor  increase  in  running  time. 

LEMMA  2.2   Suppose   that   the   problem  II   can  be  weakly  (strongly)  solved  by  an 

EREW-PTBM  (CREW-PTBM,  CRCW-PTBM)  with  p  processors  in   time  T.   Then   n   can   be 

weakly  (strongly)  solved  in  time  3T  by  an  EREW-PTBM  (CREW-PTBM,  CRCW-PTBM)  with  p 

processors  that  satisfies  the  following  condition: 

When  a  processor  accesses  a   data   register   then  the   state   of   the   processor 

uniquely  identifies   the   identity  of   the   input   stored  in  that  register  (the 

processor  "knows  in  advance"  the  identity  of  each  input  that  it  accesses). 

PROOF :    We  shall  show  informally  how  to  simulate  each  step  of  a  PTBM  M  by   three 

steps  of  a  PTBM  M'  that  fulfils  above  condition.   For  each  data  register  D  of  M  a 

control  register  Cp  is  set  in  M'   which   stores   the   index  of   the   data   item 

contained   in  D   (i.e.    the  index  of  the  data  register  which  initially  contains 

this  item),  or  0  if  D  has  not  yet  been  modified. 

Each  TEST  instruction  of  the  form  D^  :  D^  which  is  executed  by  M  is  simulated   on 

M'   by   the   sequence   of   instructions   READ   C„  ,   READ   Cq  ,  and  D,  :  D-.   The 

processor  executing  these  instructions  also  "stores"  in  its   internal   state   the 

values  read  from  Cq_  and  Cp _  until  the  test  is  done. 

Each  MOVE   instruction  of   the   form  D.  :=  D.   is  simulated  by  the  sequence  of 

instructions  READ  C^ _ ,  Cq_  :=  c,  and  D^  :=  D.,  where  c  is  the  value   returned   by 

the   first   read   if   that  value  is  not  0,  or  j  if  that  value  is  0.  The  processor 

executing  these  instructions  "stores"  in  its  internal  state  the  value  returned  by 

the  first  READ  until  the  move  is  executed. 

Any   other   instruction   is   simulated   on  M'  by  a  sequence  of  three  instructions 

consisting  of  two  "noop"  instructions  followed  by  the  simulated  instruction  (this 

ensures  a  correct  synchronization  of  the  simulation). 
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We  leave   to   the   reader   the   details  of  the  definition  of  the  simulating 

processors.   The   simulating  machine  M'   fulfils   the   required  condition, 

conflicting  reads  or  writes  occur  on  M'  only  if  they  occured  on  M,  and  if  M  halts 
after  T  steps  then  M'  halts  after  3T  steps  in  the  same  configuration. 


Note  that  if  a  CREW-PTBM  or  CRCW-PTBM  satisfies  the  condition  of  lemma  2.2 
then  it  is  possible  to  avoid  altogether  data  move  instructions:  Since  a  processor 
"knows"  the  identity  of  an  input  stored  in  a  data  register  before  it  accesses  it 
the  processor  can  access  the  data  register  which  contains  the  original  copy  of 
this  input.  The  same  holds  true  on  an  EREW-PTBM  if  the  initial  inputs  are 
duplicated  to  allow  conflict-free  access  to  them  by  all  the  processors.  However, 
if  each  input  is  initially  available  once,  it  still  might  be  necessary  to  create 
multiples  copies  of  the  data  that  different  processors  of  the  EREW-PTBM  can 
access  simultaneously. 


3.  PARALLEL  SEARCHING  IN  THE  CREW  MODEL 

The  search  problem  has  several  variants,  all  of  which  have  essentially   the 
same  complexity.   For  sake  of  simplicity  we  shall  consider  the  following  version. 

RANGE  SEARCH  PROBLEM  (for  a  table  of  size  n) :  Given  n+1  distinct  inputs 
x^,..,Xj^,y  such  that  x^  <  X2  <...<  x^  find  the  index  i  such  that  x.^  <  y  <  x^+j^ 
(by  definition  Xq  =  -»  and  x^^+j^  =  °°). 
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We   first   determine  the  complexity  of  searching  in  the  CRCW  and  CREW  models 
of  parallelism. 

THEOREM  3.1  The  range  search  problem  for  a  table  of  size  n  can  be  (strongly) 
solved  on  a  CREW-PTBM  with  p  >  1  processors  in  time  0(lg(n)/lg(p) ). 
PROOF :  The  algorithm  used  is  the  obvious  extension  of  binary  search  to  p 
processors,  namely  p-ary  search:  p-1  keys  are  chosen  that  divide  the  list  of  keys 
into  p  intervals  of  roughly  equal  length.  These  keys  are  compared  in  parallel  to 
the  searched  key.  The  comparisons  locate  the  searched  key  within  one  of  the 
subintervals ,  and  the  search  proceeds  recursively  within  this  subinterval.  At 
each  iteration  the  length  of  the  list  is  decreased  by  a  factor  of  p,  so  that  the 
search  ends  in  lg(n)/lg(p)  stages.  It  remains  to  be  shown  that  each  iteration 
can  be  implemented  in  constant  time  without  concurrent  writes. 

Note  that  the  searched  key  is  located  in  the  i-th  subinterval  iff  the 
outcomes  of  the  comparisons  made  at  the  i-1  processor  and  at  the  i-th  processor 
are  different.  Each  processor  can  in  constant  time  match  the  outcome  of  the 
comparison  it  performed  against  the  outcome  of  the  comparison  performed  by  its 
left  neighbour.  The  unique  processor  that  detects  the  subinterval  containing  the 
searched  key  updates  then  the  shared  information  on  the  search  interval.  All  the 
processors  next  read  this  information,  and  proceed  to  search  within  the  new 
interval.   The  complete  algorithm  is  given  below. 

SEARCH(Y,X,BOT,TOP) 

/*  Search  for  key  Y  in  sorted  list  X[BOT] , . . ,X[TOP] . 
We  assume  that  X[BOT]  <  Y  <  X[TOP]. 
Initially  BOT  =  0,  TOP  =  n+1, 
X[BOT]  =  -«>,  X[TOP]  =  ^  */ 

CONS  P;  /*  Number  of  processors  */ 

VAR   C  [  0 .  .  P  ]  ; 
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/*  Vector  storing  outcomes  of  comparisons; 
initially  C[0]  =  0,  C[P]  =  1       */ 

BEGIN 

WHILE  (TOP  >  BOT+1)  DO 

FOR  J  IN  [l..(P-l)]  PARDO 
VAR  I; 

/*  Compute  index  of  key  to  be  compared  */ 
I  :=  BOT  +  J*(TOP-BOT)/P; 

/*    Compare  and  store  outcome         */ 
C[J]  :=  IF  (Y  >  X[I])  THEN  0  ELSE  1 
ODRAP; 
FOR  J  IN  [1..P]  PARDO 
/*  Compute  next  interval;  the  assignment 

is  done  by  a  unique  processor.     */ 
IF  (C[J-1]  <  C[J])  THEN 

<BOT,TOP>  :=  <B0T+(J-1)*(T0P-B0T)/P,I> 
ODRAP; 
OD; 
RETURN (BOT) 
END 


It  is  easy  to  see  that,  for  any  fixed  table  size,  this  algorithm  can  be 
implemented  onto  a  CREW-PTBM  with  p  processors,  such  that  each  iteration  takes  a 
constant  number  of  steps. 


Note  that  the  full  power  of  concurrent  reads  is  not  needed  to  implement  this 
algorithm.  It  is  sufficient  to  have  a  shared  memory  machine  with  a  broadcasting 
ability:  One  (fixed)  processor  is  able  to  broadcast  in  constant  time  one  item  of 
information  to  all  the  other  processors. 

Parallel  search  in  the  continuous  case  was  studied  by  Gal  and  Miranker  in 
[GM] ,  where  a  parallel  version  of  the  bisection  algorithm  for  root  finding  is 
given  (the  problem  of  processors  coordination  is  ignored).  An  adversary  argument 
shows  that  the  policy  of  splitting  at  each  stage  the  interval  of  possible  values 
into  equal  length  subintervals  is  optimal.   A  similar  method  can  be  used  to  prove 
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a   lower  bound   on  parallel   (discrete)   search.   We  give  here  an  alternative, 
information-theoretic  lower  bound  argument. 

THEOREM  3.2  At  least  n(lg(n+l )/lg(p+l ) )  steps  are  required  to  solve  (weakly) 
on  a  CRCW-PTBM  with  p  processors  the  range  search  problem  for  a  table  of  size  n. 
PROOF :  Let  M  be  a  CRCW-PTBM  that  solves  the  range  search  problem  with  p 
processors  in  time  T,  and  let  C,,..,C  be  the  control  registers  of  the  PTBM.  Let 
x^ , . . ,Xj^  be  a  fixed  ordered  table  of  keys.  We  shall  consider  the  behaviour  of 
the  algorithm  on  this  table,  for  varying  y. 

We  assume  w.l.g.  that  M  fulfils  the  condition  of  Lemma  2.2.  In  particular 
the  identity  of  the  inputs  involved  in  a  test  is  uniquely  determined  by  the  state 
of  the  processor  executing  this  test.  We  can  therefore  assume  that  only 
nontrivial  tests  comparing  y  and  x..  for  some  i  are  performed. 

The  state  of  a  PTBM  at  step  t  can  be  described  by  the  values  of  the  control 
registers  and  the  state  of  each  processor  at  that  step.  We  formally  define 
ID(y,t),  the  instantaneous  description  of  the  machine  at  time  t  for  input  y,  to 
be  the  tuple  <s^ , . . ,s  ,cj , . . ,Cq>,  where  s^  is  the  state  of  processor  i,  and  c^  is 
the  value  of  register  C.  when  step  t  starts.  ID(y,t+l)  is  uniquely  determined  by 
ID(y,t)  and  by  the  outcomes  of  the  comparisons  performed  at  step  t.  Also,  the 
identity  of  the  inputs  involved  in  the  comparisons  performed  at  step  t  is 
uniquely  determined  by  ID(y,t).  At  step  t  at  most  p  comparisons  are  performed, 
and  these  comparisons  can  have  at  most  p+1  distinct  outcomes.  Thus,  if  I  is  the 
number  of  distinct  instantaneous  descriptions  at  step  t  we  have  Iq  =  1 ,  and 
'''t+l  ^  (p+l)I^.  Finally,  since  the  problem  has  n+1  distinct  outcomes, 
Lj,  >    (nfl).   It  follows  that  T  >   log(^^)(n+l). 


As  any  CREW-PTBM  computation  is  also  a  valid  CRCW-PTBM  computation,  the  last 
lower  bound  is  also  valid  for  CREW-PTBMs.  Thus  the  complexity  of  searching  by 
comparisons  a  table  of  size  n  with  p  processors  is  0(lg(n+l)/lg(p+l )) ,  if 
concurrent  reads  are  supported,  both  for  parallel  machines  that  supports 
concurrent  writes  and  for  parallel  machines  that  do  not  support  concurrent 
writes,  both  for  the  weak  and  strong  notions  of  solution. 


4.  PARALLEL  SEARCHING  IN  THE  EREW  MODEL 

Concurrent  reads  may  occur  at  two  places  in  the  algorithm  given  in  Th.  3.1: 
A  key  may  be  accessed  simultaneously  by  more  than  one  processor,  and  the  shared 
information  on  the  search  interval  is  accessed  concurrently  by  all  the  processors 
at  each  iteration.  While  the  first  type  of  concurrent  reads  can  be  avoided  by 
careful  programming,  the  second  one  seems  to  be  essential  in  this  algorithm: 
Information  about  the  next  interval  where  iteration  proceeds  has  to  be  broadcast 
to  all  the  processors,  and  this  can  be  done  in  constant  time  only  if  all  the 
processors  can  simultaneously  fetch  the  same  value.  If  concurrent  reads  are 
forbidden  then  the  coordination  of  all  the  processors  at  each  iteration  will 
require  f2(lg  p)  steps,  and  the  running  time  of  the  algorithm  will  be  f2(lg  n) , 
with  practically  no  gain  obtained  from  parallelism.  It  turns  out  that  this  is 
essentially  the  best  performance  that  can  be  achieved  while  searching  by 
comparisons  on  a  parallel  machine  that  does  not  support  concurrent  accesses  to 
the  same  location  in  memory. 


-16- 
THEOREM  4.1   n(lg(n)  -  lg(p))   steps  are   required   to   (weakly)    solve   by 
comparisons  the  range  search  problem  for  a  table  of  size  n  on  an  EREW-PTBM  with  p 
processors,  even  if  all  the  data   is   initially   replicated   p   times   to  allow 
conflict-free  access  to  it  by  each  processor. 

PROOF :  The  lower  bound  results  from  the  lack  of  a  mechanism  to  distribute 
instantaneously  information  in  the  system.  In  order  to  prove  it  we  have  to  trace 
at  each  step  the  "information"  available  at  each  processor  and  at  each  control 
register.  Note  that  the  "information"  available  at  a  control  register  may  change 
even  if  this  register  is  not  written  upon,  as  the  fact  that  no  processor  assigned 
to  it  a  value  is  informative  in  itself  (see  [CD]). 

Let  M  be  an  EREW-PTBM  with  p  processors  that  solves  the  range  search  problem 
in  time  T.  We  assume  w.l.g.  that  M  fulfils  the  condition  of  Lemma  2.2.  Let 
Pj,..,P  be  the  p  processors,  and  C]^,..,C  be  the  q  control  registers  used  by  the 
machine.  As  in  the  proof  of  Th.  3.2  we  fix  the  value  of  the  keys  x,,..,x  ,  and 
consider  the  behaviour  of  the  algorithm  on  inputs  x,,..,x^,y,  for  varying  y. 

We  say  that  y  e  ^  y'  (p^)  (y  and  y'  are  indistinguishable  at  step  t  by  P^) 
if  the  state  of  P^  at  step  t  in  a  computation  starting  with  input  y  is  identical 
to  the  state  of  P,.  at  step  t  in  a  computation  starting  with  input  y'.  Similarly 
y  =  ^.  y'  (C.)  (y  and  y'  are  indistinguishable  at  step  t  by  C-)  if  the  content  of 
C.  at  step  t  in  a  computation  starting  with  input  y  is  identical  with  the  content 
of  C.  at  step  t  in  a  computation  starting  with  input  y' .  The  relations 
y  =  ^  y'(P^)  and  y  =^  (C-)  are  equivalence  relations  over  R. 
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We  say  that  x  is  a  critical  point  of  the  equivalence  relation  =  if  it  is  a 
boundary  point  of  an  equivalence  class,  i.e.  for  each  e  >  0  there  exist  y,  y' 
such  that  |x-y|  <  c  and  |x-y'|  <  c,  and  y  t  y' .  Clearly,  if  there  is  no  critical 
point  in  the  segment  [y,y' ]  then  y  E  y' .  Note  that  all  the  critical  points  of 
the  two  equivalence  relations  defined  above  are  in  the  set  {x^,..,Xj^}. 

Let  Cr(P.,t)  be  the  set  of  critical  points  associated  with  the  equivalence 
relation  defined  by  P^  at  step  t,  and  let  cr(P^,t)  =  |Cr(P^,t)l.  Let  Cr(Cj,t)  be 
the  set  of  critical  points  associated  with  the  equivalence  relation  defined  by  C. 
at  step  t,  and  let  cr(C.,t)  =  lCr(Cj,t)|.  Finally,  let 
cr(t)  =  I  cr(P-,t)  +  Z  cr(C^,t).   We  initially  have  no  critical  points,  so  that 

cr(0)  =  0.  (A.l) 

If  y  <  X.  <y'  then  y  and  y'  are  associated  with  distinct  outcomes  of  the   problem 
and  therefore  must  be  distinguished  by  some  processor  up  to  step  T.  Thus, 

{Cr(P^,T)}  =  {xi,..,x^},  and 

cr(T)  >    n.  (^.2) 

We  shall  end  our  proof  by  showing  that 

cr(t+l)  <  3cr(t)  +  p.  (^-3) 

Indeed,  4.1  and  4.3  imply  that 

cr(t)  <  ^  ,  p. 


which  together  with  4.2  yields  the  inequality 
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T  >   log3((2n/p)+l)  >  log3(n)  -  log3(p). 

The  cr(P.,t)  critical  points  associated  with  the  equivalence  relation 
defined  by  P^  at  step  t  divide  the  real  line  into  crCP^.t)  +  1  open  intervals 
such  that  inputs  y  belonging  to  the  same  interval  are  equivalent.  In  particular, 
if  y  and  y'  belongs  to  the  same  interval  then  P.  executes  at  step  t  the  same 
instruction  for  y  and  y' •  Moreover,  if  this  instruction  is  a  test  instruction 
then  it  involves  in  both  cases  the  same  inputs.  We  can  therefore  classify  these 
intervals  according  to  the  type  of  instruction  executed:  "TEST"  intervals,  i.e. 
intervals  of  y  values  for  which  the  instruction  executed  by  P.  at  step  t  is  a 
comparison,  "READ  C."  intervals.  "WRITE  C^"  intervals,  and  "MOVE"  intervals.  Let 
t.  be  the  number  of  endpoints  of  "TEST"  intervals,  and  w^  •  be  the  number  of 
endpoints  of  "WRITE  C."  intervals.  Each  critical  point  is  counted  exactly  twice, 
once  for  each  of  the  two  intervals  it  bounds.   Thus 

2cr(P. ,t)  >    t.  +  Zw.  ..  (4.4) 

If  y^  and  yi  belong  to  the  same  "TEST"  interval  I,  and  P^  executes  at  step  t 
for  inputs  y  in  this  interval  a  comparison  involving  y  and  x.  then 
^1  ^t+1  y2^^i^'  unless  x.:  separates  y^  from  y7.   Thus,  I  Cr(Pj^,t+l)    {x^}. 

If  yj  and  yo  belong  to  the  same  "READ  C^"  interval  I  then  y^  =j.^.^  ynCPj^), 
unless  y^  f^   y2(C.).   Thus  I  Cr(P^,t+l)   Cr(Cj,t). 

If   y^   and   yo   belong   to   the   same   "WRITE   C-"   or  "MOVE"  interval  then 

Yi  =t+i  y2(Pi)- 
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Cr(C,t)  • ^ '- 


R 


Cr(Pi,t)  1 i H 

Cr(P„t)  ^ 1 5 H 1 A 1 


Cr(C,t+l)  i 1- 


Cr(Pi,t+l)  1 1 \- i h 


CrCPo.t+l)  1 1 i 1 \ 1 f- 


-  "MOVE"  segment   R  -  "READ"  segment   T  -  "TEST"  segment   W  -  "WRITE"  segmei 


Figure  1 


Finally,  if  y^  =^  y2(Cj)  then  y^  =^+1  yi'^^i+l),  unless  some  processor  P^ 
modifies  C.  at  step  t  in  a  computation  with  input  y -^  but  does  not  modify  C^  at 
step  t  of  a  computations  starting  with  y,,,  or  vice-versa,  or  modifies  C^  in  both 
computations,  but  assign  in  each  one  a  different  value  to  C..  It  follows  that 
y,  =^+i  y2(C-;)  unless  the  endpoint  of  a  "WRITE  C^"  interval  associated  with  some 
P.-  at  step  t  separates  yj^  from  y^. 


We  can  therefore  conclude  that 

1.  If  xeCr(P^,t+l)  then  either  x  is  an  old  critical  point  from  CrCP^.t), 
or  X  is  a  critical  point  "copied"  from  Cr(C.,t),  or  x  is  a  new  critical  point 
created  by  a  comparison.  Moreover,  comparisons  may  create  at  most  one  new 
critical  point  in  each  "TEST"  interval,  anc 
be  copied  into  at  most  one  set  Cr(P,.  ,t+l). 

2.  If  xeCr(C.,t+l)  then  either  x  is  an  old  critical  point  from  Cr(Cj,t), 
or  x  is  an  endpoint  of  a  "WRITE  C."  interval  associated  with  some  P^  at  step  t. 

The  process  of  obtaining  the  partitions  associated  with  step  t+1  from  the 
partitions  associated  with  step  t  is  illustrated  in  Figure  1  for  one  control 
register  and  two  processors. 

The  number  of  "TEST"  intervals  associated  with  P^.  is  bounded  by  (t-j^  +  2)/2 
(+=°  and  -"  can  also  be  endpoints  of  such  intervals).  We  obtain  therefore  the  two 
inequalities 

?  cr(P.  ,t+l)  <  Z  cr(P.  ,t)  +  i?:(t.+2)  +  I  cr(C.,t) 

=  2  cr(P.-  ,t)  +  il  t-  +  Z  cr(C.,t)  +  p,  (4.5) 

i      ^       Zi   -L    j      J 


cr(Cj,t+l)  <  cr(Cj,t)  +  I   v^y  (4.6) 
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Combining  inequalities  4.4  -  4.6  we  obtain 

cr(t+l)  =  I   cr(P. ,t+l)  +  I   cr(C.,t+l) 

<  I  cr(P_.  ,t)  +  ll  t.  +  21  cr(C.,t)  +  Z  w-  .  +  p 

i     ^      2i   1     j     J     i,j  ^J 

<  3cr(t)  +  p.  (4.7) 

• 

Note  that  we  did  not  use  in  the  proof  the  fact  that  concurrent  writes  are 
not  allowed.  Indeed,  the  lower  bound  is  still  valid  for  an  ERCW  (exclusive  read 
concurrent  write)  parallel  machine. 

The  last  lower  bound  is  optimal.  The  following  simple  method  can  be  used  to 
search  a  list  of  length  n  with  p  processors:  We  divide  the  searched  list  into  p 
sublists  each  of  length  n/p,  and  perform  with  each  processor  a  binary  search  on 
one  of  the  sublists.  This  algorithm  can  be  implemented  to  run  in  time  0(lg(n/p)) 
without  concurrent  access  to  memory  if  the  value  of  the  searched  key  is  initially 
available  at  each  processor.  If  the  searched  key  is  initially  available  at  only 
one  memory  location  then  lg(p)  extra  steps  are  required  to  make  it  available  to 
all  the  p  processors,  thus  yielding  an  0(lg  n)  running  time.  It  turns  out  indeed 
that  a  ^(Ig  n)  lower  bound  can  be  proved  for  the  range  search  problem  on  an 
EREW-PTBM  if  each  input  is  initially  stored  in  exactly  one  memory  location. 

THEOREM  4.2  fl(lg(n))  steps  are  required  to  solve  by  comparisons  the  range 
search  problem  for  a  list  of  length  n  on  an  EREW-PTBM,  independently  of  the 
number  of  processors  used. 

PROOF :  An  informal  argument  supporting  the  last  claim  goes  as  follows.  Suppose 
that  at  least  clg(n/p)  are  needed  to  search  in  the  EREW  model  a  list  of  length  n 
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with  p  processors  that  have  conflict-free  access  to  the  searched  key.   Assume  now 

that  the  searched  key  y  is  stored  in  a   unique   location  at   the   start   of   the 

search.   at  most  2*^  processors  can  access  y  and  participate  in  the  search  within 

k  steps.   The  search  must  take  therefore  at  least  min  max(k,clg(n/2  )  =  _lL.«lg(n) 

k  c+1 

steps. 

The  actual  proof  is  more  complex  as  the  set  of  processors  that  have  access 
to  the  value  of  y  in  a  given  computation  varies  as  a  function  of  y.  Nevertheless, 
it  is  a  simple  modification  of  the  proof  given  for  Theorem  4.1.  Let  M  be  an 
EREW-PTBM  with  p  processors  that  solves  the  range  search  problem  in  time  T.  We 
assume  w.l.g.  that  P  fulfils  the  condition  of  Lemma  2.2,  and  that  only 
nontrivial  comparisons  are  performed.  Let  cr(t),  t.,  and  w •  •  be  defined  as  in 
the  proof  of  Theorem  4.1.  Assume  by  contradiction  that  T  <  Ig^Cn).  For  each 
value  of  the  searched  key  y  there  are  at  most  2^^  processors  that  can  access  y 
during  the  T  steps  of  a  computation  starting  with  input  y.  In  particular,  at  each 
step,  each  point  x  can  be  the  endpoint  of  at  most  2^  distinct  "TEST"  intervals. 
The  same  holds  true  for  the  two  points  -°°  and  -H=°.  The  total  number  of  "TEST" 
intervals  at  a  given  step  can  not  exceed  therefore 

Substituting  back  in  inequality  4.5  we  obtain 

2  cr(P. ,t+l)  <  Z  crCP^.t)  +  +  h    t.    +   I      cr(C.,t)  +  2^.  (4.5') 

Inequality  4.7  is  accordingly  replaced  by  the  inequality 

cr(t+l)  <  3cr(t)  +  2^  (4.7'). 

We  obtain  therefore  that 
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n  <  cr(T)  <  LSLL'2^    <  6^, 


so  that  T  >  log^(n),  contradicting  the  assumption  on  T. 


The  last  lower  bound  is  valid  even  if  concurrent  accesses  to  the  keys  from 
the  searched  table  are  allowed  (but  not  to  the  searched  key).  In  particular,  it 
is  valid  even  if  each  processor  has  its  own  private  copy  of  the  table. 


5.  PARALLEL  SEARCH  WITH  GENERALIZED  COMPARISONS 

The  f2(lg(n+l )/lg(p+l ))  lower  bound  on  search  by  comparisons  results  from  the 
fact  that  it  is  not  possible  to  perform  irredundant  independent  comparisons: 
Whereas  p  binary  tests  can  partition  the  set  of  inputs  into  2?  subsets,  according 
to  the  2P  different  combinations  of  the  outcomes  of  the  p  tests,  p  independent 
comparisons  partition  the  set  of  inputs  into  p+1  subsets  only.  Thus,  rather  than 
obtaining  p  bits  of  information  at  each  step,  one  obtain  only  lg(p+l)  bits  of 
information.   This  problem  is  avoided  by  using  more  general  comparisons. 

THEOREM  5. 1  If  comparisons  between  polynomial  functions  of  the  inputs  are 
allowed,  then  it  is  possible  to  solve  (weakly)  the  range  search  problem  for  a 
table  of  size  n  in  0((lgn)/p)  steps  with  p  processors  on  a  CREW-PTBM;  If  the  data 
is  replicated  p  times  to  allow  conflict-free  access  to  it  then  the  same 
performance  can  be  achieved  on  an  EREW-PTBM. 

PROOF :    We  shall  describe  the  algorithm  informally,  leaving  to   the   reader   the 
details  of  an  implementation  on  a  PTBM.   Let  us  assume  w.l.g  that  n+1  =  2^.   Let 
Pi(y,Xi,..,Xj^)  =  (x^-y)(xi+i-y),  i=l,..,n-l, 


PqCy.xj,  ..  ,Xjj)  =  (y-x^),  and 

P^(y,xi,..,x^)  =  (Xj^-y). 

Note   that   p.  (y  ,x^ , . .  ,x^)  <  0   iff   x^^  <  y  <  x^+j^.   Thus   if   H  is  a  subset  of 

{0,..,n}  then  R   p_.  <  0  iff  y  in  the  i-th  range  for  some  ieH. 
ieH  "'" 

Let  H.,  1  <  j  <  k,  be  the  subset  of  {0,..,n}  consisting  of  all  the  numbers 
with  a  zero  in  the  j-th  bit  of  their  binary  representation.  If  0  <  i j^  <  i2  <  n 
then  there  exist  at  least  one  set  H.  such  that  iieH^  but  i2?!H^,  or  vice  versa. 
Thus,  the  outcomes  of  the  k  comparisons  11  p.  :  0,  i  =  l,...,k,  uniquely 
determines  the  range  of  y. 

The  algorithm  consists,  therefore,  of  performing  on  each  of  the  p  processors 
k/p  of  these  comparisons,  in  0(k/p)  steps. 


The  performance  of  the  algorithm  is  optimal,  as  the  range  search  problem  has 
a  sequential  complexity  of  lg(n+l),  and  p  processors  can  at  most  achieve  a 
speedup  of  p  over  the  sequential  algorithm.  However,  this  algorithm  assumes  that 
each  processor  accesses  at  each  iteration  at  least  half  of  all  the  keys  in 
constant  time.  We  shall  show  in  section  7  that  if  only  one  data  item  can  be 
accessed  by  each  processor  at  a  time  then  polynomial  comparisons  do  not  speed  up 
searching.  Thus,  the  improved  performance  obtained  by  the  algorithm  of  Theorem 
5.1  is  mainly  a  facticious  effect  of  accounting  one  time  step  for  an  operation 
involving  an  unbounded  number  of  inputs  at  once. 
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6.  MACHINES  WITH  LOCAL  MEMORY 

In  a  large  parallel  machine  access  to  shared  memory  is  likely  to  be  much 
more  time  consuming  than  local  processing.  It  is  therefore  natural  to  measure 
the  time  complexity  of  algorithms  by  the  number  of  accesses  to  shared  memory, 
with  local  processing  being  free.  The  parallel  machine  model  can  be  modified  to 
reflect  this  approach:  Each  processor  has  its  own  local  memory;  It  can,  in  one 
step,  either  copy  an  item  from  shared  memory  into  its  private  memory,  or  copy  an 
item  from  its  private  memory  to  the  shared  memory.  In  addition,  it  can  perform 
after  the  global  access  operation  any  amount  of  local  computing. 

We  specialize  this  general  approach  to  the  PTBM  model.  A  Parallel  Test  and 
Branch  Machine  with  Local  Memory  (PTBML)  is  defined  by  modifying  the  PTBM  model 
in  two  ways:  Firstly,  in  addition  to  shared  data  registers  D.  and  shared  control 
registers  C.,  local  data  registers  LD^  ^  and  local  control  registers  LC^^j  are 
associated  to  each  processor  P_.  .  Secondly,  we  modify  the  set  of  instructions: 
processor  P,  can  execute  instruction  of  one  of  the  following  types: 

\ 
\ 

3.  C^  :=  LCi^j 

4.  D^  :=  LD.^j 


5.   LD.  :  LD^, 


7.   LC  . 

1 
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Thus  shared  memory  is  accessed  only  in  order  to  transfer  information  from 
shared  memory  to  local  memory  and  back.  This  is  done  by  the  first  four  types  of 
instructions  which  are  global  instructions.  The  actual  processing,  that  is 
comparisons  and  reading  and  writing  of  control  registers  is  done  in  local  memory 
by  local  instructions. 

A  communication  cycle  for  a  processor  consists  of  a  sequence  of  steps, 
starting  with  a  step  where  a  global  instruction  is  executed,  and  ending  when  the 
next  global  instruction  or  a  final  state  is  reached  (we  assume  that  the  first 
instruction  executed  by  each  processor  is  global).  The  processors  are 
synchronized  at  the  communication  cycle  level:  during  one  time  interval  each 
processor  executes  one  communication  cycle,  and  the  accesss  to  shared  memory 
occuring  during  this  interval  are  considered  to  be  concurrent.  The  computation 
starts  with  inputs  being  stored  in  shared  memory. 

Note  that  PTBML's  like  PTBM's  come  in  three  flavours,  namely  EREW-PTBML, 
CREW-PTBML  and  CRCW-PTBML,  according  to  the  rules  governing  concurrent  access  to 
shared  memory. 

It  turns  out  that  the  ability  to  do  an  unrestricted  amount  of  local 
computing  does  not  speed  up  search.  Indeed,  if  the  value  of  the  searched  key  is 
available  at  each  processor  then  at  most  one  usefull  comparison  can  be  done  on 
each  new  key  read  into  local  memory,  namely  comparing  it  to  the  searched  key. 
Thus,  if  the  value  of  the  searched  key  can  be  distributed  to  all  the  processors 
in  constant  time  then  a  PTBM  (with  no  local  registers)  can  simulate  in  "real 
time"  a  PTBML  on  the  search  problem.   While  this  is  not  strictly  true  in  the  EREW 
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model   where   the   distribution  of  the  searched  key  takes  ii(lgp)  steps,  the  lower 
bound  of  theorem  4.2  nevertheless  applies  to  PTBML's  as  well. 

THEOREM  6 . 1    (i)  ^(lg(n) /lg(p) )   communication  cycles  are   required   to   solve 

(weakly)  the  range  search  problem  on  a  CRCW-PTBML  with  p  processors. 

(ii)   n(lg(n)  -  lg(p))   communication   cycles  are  required  to  solve  (weakly)  the 

range  search  problem  on  an  EREW-PTBML  with  p  processors,  even  if  the  inputs  are 

replicated. 

(iii)  fi(lgn)  communication  cycles  are  required  to  solve  (weakly)  the  range  search 

problem  on  an  EREW-PTBML  if  the  inputs  are  not  replicated,  independently  of   the 

number  of  processors. 

PROOF:    Let  M  be  a  PTBML  with  p  processors  that  solves  the  range  search  problem 

in  time  T.  The  simulation  described  in  Lemma  2.2  can  be  performed  for  PTBML's  as 

well.   We  can  therefore  assume  w.l.g.   that  the  identity  of  data  item  accessed  in 

a  data  move  or  test  instruction  is   "known"   to   the   processor   performing   that 

instruction.   We   can  also  assume  that  each  processor  performs  at  each  cycle  all 

the  comparisons  on  the  data  items   stored   in  the   local   data   registers  which 

outcomes   are   not  yet  known,  and  preserve  the  result  of  these  comparisons  in  its 

state.   In  a  CRCW-PTBML,  or  an  EREW-PTBML  with   replicated   inputs  we   can  also 

assume  that  the  searched  key  is  copied  into  a  local  register  by  each  processor  at 

the  first  cycle,  and  not  read  from  shared  memory  at  subsequent  cycles. 

Let  us  define  ID(P..  ,t),  the  instantaneous  description  of  processor  i  at  step 
t,  to  be  the  tuple  <s,Cp..,c^>^  ^^^^^.g  g  ^^  ^he  state  of  the  finite-state  control 
of  P,  and  ci,..,Cj.  is  the  content  of  the  local  control  registers  at  start  of 
cycle  t. 


ID(P^,t+l)  depends  uniquely  on  ID(P^,t),  and 

1.  If  P^  copies  into  local  memory  a  shared  control  register  at  cycle  t,  then 
on  the  value  of  that  register. 

2.  If  P.  copies  into  local  memory  a  shared  data  register  at  cycle  t,  then  on 
the  outcome  of  all  the  comparisons  between  the  new  data  item  copied  and  the 
data  items  previously  available  in  local  memory. 

Note  however  that  our  assumptions  on  M  imply  that  a  new  data  item  copied 
into  local  memory  must  be  a  key  x.,  and  at  most  one  nontrivial  comparison  can  be 
performed  between  this  new  data  item  and  the  data  items  previously  available  in 
local  memory,  namely  the  comparison  x.  :  y.  It  is  therefore  possible  to  simulate 
M  by  a  PTBM  M' :  The  set  of  states  of  processor  P^  of  M'  are  the  set  of 
instantaneous  descriptions  of  processor  P.  of  M;  the  local  data  registers  of  P^ 
are  replaced  by  a  distinct  set  of  shared  registers  in  M' ;  each  communication 
cycle  of  M  is  simulated  by  two  steps  of  M' ;  if  P.  transfers  an  input  x-  from 
shared  memory  in  that  step  then  P',.  copies  x^  and  performs  the  comparison  y  :  x^; 
if  P^  writes  onto  a  shared  data  register  then  P^  performs  the  corresponding  MOVE 
instruction;  and  if  P,.  reads  from  or  writes  onto  a  shared  control  register  then 
P'^  performs  the  same  operation.  We  leave  to  the  reader  the  details  of  this 
simulation.   This  completes  the  proof  of  (i)  and  (ii). 

The  last  argument  does  not  extend  in  a  straightforward  manner  to  (iii):  the 
searched  key  y  is  not  initially  available  at  each  processor,  and  therefore,  when 
it  is  read  in  there  might  be  several  comparisons  to  be  done  in  one  communication 
cycle.  We  shall  show  however  that  the  f2(lgn)  lower  bound  applies  even  if  we 
assume  that  the  searched  key  is  available  from  the  first  step  at  each  processor 
that  eventually  gains  access  to  it. 


-29- 

Let  us  define  a  PTBM  with  a  k-oracle  for  the  search  problem  as  follows:  Let 
p  be  the  number  of  processors  and  let  y,Xj^,..,Xj^  be  the  inputs  to  the  problem,  p 
data  registers  are  associated  with  the  input  y.  Each  input  x.  is  initially  stored 
in  one  register.  The  searched  key  y  is  stored,  however,  in  k  of  the  registers 
associated  with  it,  the  choice  of  these  registers  being  done  by  the  oracle. 
These  k  registers  are  accessed  only  for  tests,  so  that  the  searched  key  y  is  not 
copied  to  other  registers  during  the  computation.  A  PTBML  with  a  k-oracle  is 
defined  in  a  similar  manner:  each  input  Xj^,.,,,Xj^  is  stored  in  one  shared  data 
register,  whereas  the  searched  key  is  stored  in  the  local  registers  of  k 
processors  chosen  by  the  oracle;  no  processor  copies  y  to  shared  memory  during 
the  computation.   We  have  the  following  facts. 

A.  If  the  range  search  problem  can  be  solved  in  T  communication  cycles  by  an 
EREW-PTBML  M  with  p  processors  then  it  can  solved  in  cT  communication  cycles  by  a 
p-processor  EREW-PTBML  with  a  2'^-oracle. 

Note  first  that  at  most  2^  processors  of  M  will  have  access  to  the  searched 
key  y.  We  first  construct  an  EREW-PTBML  M'  with  p  processors  that  solves  the 
range  search  in  time  cT  and  has  the  property  stated  in  Lemma  2.2:  A  processor 
"knows"  the  identity  of  the  input  stored  in  each  data  register  it  accesses. 
Next,  we  simulate  M'  with  a  p-processor  EREW-PTBML  M' '  with  a  2^  oracle.  The 
oracle  figures  out  which  processors  in  M'  are  going  to  access  the  key  y,  and 
stores  in  the  local  memory  of  each  corresponding  processor  in  M' '  a  copy  of  y; 
whenever  a  processor  in  M'  reads  from  shared  memory  a  data  register  containing  y 
the  corresponding  processor  in  M' '  reads  y  from  its  local  memory. 
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B.  If  the  range  search  problem  can  be  solved  by  a  p-processor  EREW-PTBML 
with  a  k-oracle  in  time  T  then  it  can  be  solved  in  time  T  by  an  EREW-PTBM  with  a 
k-oracle. 

Since  each  processor  that  ever  accesses  y  has  y  in  local  memory  from  the 
start  of  the  computation,  the  simulation  used  in  the  proof  of  (i)  and  (ii)  works 
here  as  well. 

C.  ^(Ign  -  Igk)  steps  are  needed  to  solve  the  range  search  problem  with  an 
EREW-PTBM  with  a  k-oracle,  independently  of  the  number  of  processors  used. 

The  proof  of  Theorem  4.2  extends  immediately  to  EREW-PTBM's  with  oracles. 
We  actually  show  there  that  if  an  EREW-PTBM  solves  the  range  search  problem  in 
time  T,  and  for  each  value  of  the  inputs  only  k  processors  have  access   to   the 

searched  key  y  (where  the  set  of  processors  accessing  y  may  depend  on  the  input 

3T-I 
values)  then  n  >   — — 'k,  so  that  T  >iog3(n)  -  lg3(k). 


(iii)  follows  now  from  A,  B  and  C. 


7.  POLYNOMIAL  COMPARISONS  IN  THE  PTBML  MODEL 

We  have  shown  in  section  5  that  searching  in  the  CREW-PTBM  model  can  be  sped 
up  by  a  factor  of  p  using  p  processors,  if  polynomial  functions  of  the  inputs  can 
be  tested.  The  computational  model  for  which  this  improved  algorithm  was 
obtained   is  not  realistic  in  that  it  allows  an  unlimited  number  of  data  items  to 
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be  processed  in  one  comparison  that  takes  one  unit  of  time.  Since  we  allow  on  a 
PTBML  an  unlimited  amount  of  local  processing  in  between  accesses  to  shared 
memory,  it  is  natural  to  permit  more  powerful  tests.  Polynomial  comparisons  in 
this  model  do  not  increase  factitiously  the  number  of  accesses  to  shared  memory 
that  can  be  executed  in  one  cycle. 

It  turns  out  that  polynomial  comparisons  do  not  save  communication  steps 
when  solving  the  range  search  problem  on  a  PTBML,  and  the  lower  bounds  proved  in 
the  last  section  are  still  valid.  We  shall  show  that  if  M  is  a  PTBML  that  solves 
the  range  search  problem  using  polynomial  comparisons,  then  one  can  replace  the 
polynomial  comparisons  in  M  by  normal  comparisons,  obtaining  a  PTBML  M'  that 
simulates  faithfully  M  on  a  selected  set  of  inputs.  This  set  of  inputs  is  rich 
enough  to  garantee  that  a  machine  that  solves  correctly  the  range  search  problem 
for  these  inputs,  solves  it  correctly  for  all  valid  inputs. 

We  first  show  that  an  algorithm  for  the  range  search  problem  that  uses  only 
normal  comparisons  can  be  certified  correct  by  running  it  on  n+1  test  cases 
corresponding  to  the  distinct  n+1  outcomes  of  the  problem. 

LEMMA  7 . 1  Let  5^,  n^  be  numbers  such  that  hq  <  5i  <  Hi  <...<  Cn  <  ^^n'  ^^^  ^  ^^ 
a  machine  that  solves  in  time  T  using  only  normal  comparisons  the  range  search 
problem  for  the  n+l  input  tuples  <ri  .,5^,  . . .  ,5^>,  j=0,...,n.  Then  M  solves  in 
time  T  the  range  search  problem  for  any  valid  tuple  of  inputs. 

PROOF :  Let  <y ,x^ , . . . ,x^>  be  a  tuple  of  inputs  such  that  x^.j  <  y  <  x^.  The 
outcome  of  each  test,  and  therefore  the  sequence  of  instructions  executed  by  each 
processor  in  a  computation  of  M  with  inputs  y ,x-^, ,. .  ,x^  ace  identical  to  those 
obtained  in  a  computation  with  inputs  n_- , C i ,  • . .  ,C^.   In  particular,  a  final  state 
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is  reached  by  M  in  a  computation  with  inputs  y,x^ x^  after   no  more   than  T 

steps,  and  this  state  is  associated  with  the  correct  outcome. 


We  proceed  now  to  show  that  for  any  machine  that  solves  the  range  search 
problem  using  polynomial  comparisons  it  is  possible  to  find  numbers 
^^Q  <  ^2  <•••<  ^n  "^  '^n  such  that  the  outcome  of  the  computation  on  inputs  x^  =  E,^, 
y  =  n-  can  be  determined  using  only  normal  comparisons.  The  key  technical  result 
is  provided  by  the  following  lemma. 

LEMMA  7 . 2  Let  11  be  a  finite  family  of  non  zero  real  polynomials  in  the 
variables  y,x,  , . . .  ,Xj^.  Define  var(p)  to  be  the  set  of  variables  the  polynomial  p 
depends  upon.  There  exist  an  increasing  sequence  of  numbers 
Hq  <  ?i  <  n^  <...<  5n  <  ^n  such  that  the  following  conditions  are  satisfied  for 
any  p  e  H. : 

(i)   p(nj,5i,...,Cn)  t   0,  j=0,...,n. 
(ii)   If  X.  ^   var(p)  then  sign  p(n j.j , C^, . . . .C^)  =  sign  p(n j, 5^, . . . , 5^) . 
(iii)   If  n  >  n^  then  sign  pCn.^i ,  .  .  .?„)  =  sign  p(t]^,E,^,  . .  .  ,Z^)  . 
PROOF ;    We   shall   prove   the   theorem  by  induction  on  n.  If  n  =  0  then  (ii)  is 
emptily  true.   Each  polynomial  p  e  II  has  a   finite   number   of   zeroes,   and   the 
validity  of   (i)  and  (iii)  can  be  garanteed  by  choosing  Hq  to  be  larger  than  any 
of  these. 

We  assume  now  that  the  claim  is  valid  for  n-1  and  proceed  to  prove  it  for  n. 
Let 

Hg  =  {p  e  n  :  n  G  var(p)  },  and  let  n^  =  nXn^. 
Assume  that  11^  =  {p^ ,  . . .  .p^.} ,  where 
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Pi  -.l\    q^j(y,xi....,x„_l)x^j, 

and  q^  t   0.   Let 

n'  =n^   {qj^.....q5^}. 

Let   Hq  <  ^-^   <...<  5^,-1  <  n^.i  be  chosen  such  that  (i)  -  (iii)  are  fulfilled 

for  the  set  II'  of  polynomials.   Let 

d  =  max  d., 

C  =  max   max    max  Iq'^Cni.,  ?i  ,  . .  .  ,?„_i  )  I  , 
l<i<r  0<j<d^  0<k<n   J   ^   ^      "  ^ 

Finally,  let 

c  =  min   min  | q^  (nu , ^ i , . . . , 5„_i )  I  . 

Note  that  c  >  0.   Choose  5^  such  that 

^>  max{n^_i.(d+l)^}. 
and  choose  n   to  be   larger   than  E,        and   than  any  zero   of   the   polynomials 
pCy.C^, ...,?^),  where  p(y .Xj , . . . ,x^)  e  n^. 

Let   p  E  n     By   the  inductive  assumption  (i)  and  (ii)  are  satisfied  for  p 

and    n.,     j=0,...,n-l.     Also,    if    ^  ^  '^n    '-^^"    ^  ^  "^n-l'  ^"^ 

sign  p(ri,5i,...,Cn_l)  =  sign  pCn^^.^i ^^-i)   =  sign  pCn^.i ,  5i, . .  •  ,  5^-1  ^  *  "^^"^  ' 

(i)  and  (ii)  are  satisfied  for  p  and  n  ,  and  (iii)  is  satisfied  for  p. 


Let  p^  e  n^,  where 

d.   ,•  . 

Pi  =jEo  ^j(y»'^l'---''^n-l)^n-' 

Our  definition  of  £   implies  that  for  any  k,  0  <  k  <  n, 


l<^d_,(n^,?i,...,S,_i)C,'^:)l  >   ciCnl'^ 
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>  dCl£  |V1 


'    l-^0'j^'k.5i.....?n-l)5n^l 


It  follows  that   sign  V^i^j^i,  •  •  •  ,^^)    =  sign  qd.  (^k'^1' • '  • '^n^  •   '^'^"^  ^^^ 

and  (ii)  are  valid,  by  the  inductive  assumption,  for  "^Q' • • • '^n-1*  ^^^   definition 

of  n^  entails  that  (i)  and  (iii)  are  satified  for  r\^,      whereas   (ii)   is   emptily 
true. 


THEOREM  7.3   Assume   that   the   range   search   problem  can  be  solved  on  an  EREW 

(CREW,  CRCW)  PTBML  with  p  processors  in  T  communication  steps,   using   polynomial 

comparisons.   Then  the  range  search  problem  can  be  solved  on  an  EREW  (CREW,  CRCW) 

PTBML  in  2T  communication  cycles  using  only  normal  comparisons. 

PROOF:    Let  M  be  a  PTBML  that  solves  the  range  search  problem  in  T  communication 

steps.   Using   the  construction  of  Lemma  2.2  it  is  possible  to  obtain  a  PTBML  M' 

that  solves  the  range  search  problem  in  2T  communication  steps  at  most  with   the 

property   that   a  processor  "knows"  the  identity  of  data  items  which  are  involved 

in  each  comparison  it  performs.   Thus,  we  may  assume  w.l.g.   that  each  comparison 

performed   by  a   processor   of  M'   is   of   the  form  p(y,x.  , . . .  ,x_-  )  :  0,  where 

±1  ^r 

y,x.  ,...,x.   are  inputs  that  are  available  in  the  local  memory  of  the  processor. 

Let  n  be  the  set  of  polynomials  corresponding  to  the  comparisons  occuring  in 

instructions   of  M' .   Choose  a  set  of  points  t]q   <  Ci  <. .  .<  ?„  <  Ij,  such  that  the 

conditions  of  Lemma  7.2  are  fulfilled.   Consider  now  the  behavior  of  M'  on  inputs 

of  the  form  <y  ,Xj^ ,  . . .  ,x^>  =  <n  ^,  S^,  . . . ,  5^>.   Conditions  (i)  and  (ii)  of  Lemma  7.2 

imply  that  the  outcome  of  a  comparison  p(y,x.  x-  )  :  0  depends  only   on   the 

1       r 

relative   position  of   v  with   respect   to  x_.  ,...,x.-  .   This  can  be  checked  by 

-^1       ^r 
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comparing  y  to  each  of  x.  ,...,x-  .  One  can  replace  in  M'  each  polynomial 
comparison  by  a  sequence  of  normal  comparisons.  The  resulting  PTBML  M' '  will 
simulate  M'  faithfully  on  inputs  of  the  form  <n  . , C j , • . .  ,  5„>,  i.e.  will  go 
through  the  same  sequence  of  communication  steps  and  halt  in  the  same  state  as 
M' .  M' '  solves  in  time  2T  the  range  search  problem  for  the  rri-l  tuples  of  inputs 
<n  .,Cp . . . ,?^>,  j=0,...,n,  and  therefore,  by  Lemma  7.1,  M' '  solves  the  range 
search  problem  in  2T  communication  steps  for  any  valid  tuple  of  inputs. 


COROLLARY  7.4  (i)  At  least  n(lg(n) /lg(p) )  communication  steps  are  needed  to 
solve  the  range  search  problem  for  a  table  of  size  n  on  a  CREW-PTBML  with  p 
processors,  even  if  polynomial  comparisons  are  used. 

(ii)  At  least  f2(lg(n))  communication  steps  are  needed  to  solve  the  range  search 
problem  for  a  table  of  size  n  on  an  EREW-PTBML,  independently  of  the  number  of 
processors,  even  if  polynomial  comparisons  are  used. 


8.  CONCLUSION 

Searching  by  comparisons  provides  us  with  a  natural   example   of  a  problem 
with  two  features: 

1.  It  can  be  solved  faster  on  a  parallel  machine  that  supports  concurrent 
reads  than  on  a  parallel  machine  where  concurrent  access  to  the  same 
location  in  memory  is  denied. 

2.  It  can  not  be  sped  up  by  parallelism,  if  concurrent  accesses  to  the  same 
memory  register  are  not  supported. 
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As  mentioned  in  chapter  2,  parallel  machines  with  a  fixed  geometry  can  be 
seen  as  a  restricted  class  of  EREW  shared-memory  parallel  machines.  Thus,  the 
i2(lg  n)  lower  bound  on  communication  steps  proven  for  search  in  Theorem  6.1  is 
valid  for  any  fixed-geometry  parallel  machine. 

Communication  is  likely  to  dominate  the  performance  of  future  large  parallel 
machines.  It  is  therefore  worthwhile  to  characterize  the  parallel  complexity  of 
problems  in  terms  of  the  number  of  communication  steps  required,  ignoring 
computational  steps.  We  did  that  in  section  6  for  parallel  searching,  and  showed 
that  the  parallel  time  complexity  for  searching  is  determined  by  the  number  of 
communication  steps  required.  It  would  be  fruitful  to  apply  the  same  approach  on 
the  analysis  of  other  parallel  algorithms.  In  particular,  are  there  natural 
problems  for  which  there  is  a  tradeoff  between  the  amount  of  communication  and 
the  amount  of  computation  needed  to  solve  them? 

The  use  of  more  powerful  coraputional  primitives  (namely,  polynomial 
comparisons)  did  not  help  in  reducing  the  number  of  communication  steps  needed 
for  searching.  Are  there  natural  problems  for  which  such  reduction  can  be 
achieved?  Can  such  reduction  be  achieved  for  searching  using  yet  more  powerful 
primitives? 

A  more  realistic  model  for  an  EREW  parallel  machine  might  be  a  machine  where 
concurrent  access  to  the  same  memory  cell  does  not  generate  an  error,  but  rather 
results  in  one  request  being  satisfied,  the  other  requests  being  rejected  (with  a 
special  busy  signal  being  returned),  or  queued.  Can  search  by  comparisons  be 
sped  up  by  parallelism  on  such  machine?  Note  that  a  processor  whose  request  to 
shared  memory  has  been  rejected  gets  some  information  from  the  rejection  itself. 
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As  noted  in  section  3,  the  0(lg(n)/lg(p) )  searching  algorithm  can  be 
implemented  on  any  EREW  shared  memory  parallel  machine  where  one  processor  has 
the  ability  to  broadcast  messages  to  all  the  other  processors  in  constant  time  (a 
BEREW  machine?).  If  all  the  processors  share  this  broadcasting  ability  (only  one 
broadcast  is  allowed  at  a  time)  then  this  algorithm  can  be  implemented  even  in 
the  absence  of  shared  memory  (this  requires,  however,  that  some  table  keys  be 
duplicated  and  assumes  that  these  keys  are  correctly  distributed  over  the  local 
memories  of  the  distinct  processors).  We  have  here  yet  another  model  of 
parallelism,  corresponding  to  a  bus-oriented  architecture.  How  does  it  compare 
with  the  other  models? 

In  a  real  parallel  machine  memory  is  likely  to  be  organized  into  modules 
with  exclusive  access  being  enforced  at  the  level  of  the  memory  module  rather 
than  at  the  level  of  the  memory  cell.  How  does  that  effect  the  performance  of 
parallel  algorithms?  The  work  of  [BDL]  is  a  useful  start  in  the  investigation  of 
such  systems. 

Finally,  this  paper  provides  a  framework  for  the  study  of  parallel 
comparison-based  algorithms,  which  accounts  for  the  coordination  overheads.  It 
is  hoped  that  this  framework  will  be  helpful  in  the  study  of  other  problems  such 
as  merging  and  sorting. 
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